Intermediate Growth Modeling with pcvr

DDPSC Data Science Core, August 2023

Josh Sumner

Outline

  • pcvr Goals
  • Load Package
  • Why Longitudinal Modeling?
  • Supported Model Builders
  • Supported Curves
  • growthSS
  • fitGrowth
  • growthPlot
  • testGrowth
  • Resources

pcvr Goals

Currently pcvr aims to:

  • Make common tasks easier and consistent
  • Make select Bayesian statistics easier

There is room for goals to evolve based on feedback and scientific needs.

Load package

Pre-work was to install R, Rstudio, and pcvr with dependencies.

devtools::load_all()# library(pcvr) # or devtools::load_all() if you are editing
library(ggplot2)
library(patchwork)

Why Longitudinal Modeling?

plantCV allows for user friendly high throughput image based phenotyping

Resulting data follows individuals over time, which changes our statistical needs.

Why Longitudinal Modeling? 2

Longitudinal Data is:

  • Autocorrelated
  • Often non-linear
  • Heteroskedastic

Why Longitudinal Modeling? 3

Why Longitudinal Modeling? 4

Why Longitudinal Modeling? 5

Supported Model Builders

Five model building options are supported through the type argument of growthSS:

nls, nlrq, nlme, mgcv, and brms

Other than mgcv all model builders can fit 9 types of growth models.

Supported Model Builders 2

“nls” “nlrq” “nlme” “mgcv” “brms”
stats::nls quantreg::nlrq nlme::nlme mgcv::gam brms::brms

type = “nls”

Non-linear least squares regression.

Longitudinal Trait nls
Non-linearity
Autocorrelation
Heteroskedasticity

type = “nlrq”

Linear Regression Quantile Regression
Predicts mean E(Y|X) Predicts quantiles Q(Y|X)
Works with small N Requires higher N
Assumes Normality No distributional assumptions
E(Y|X) breaks with transformation Q(Y|X) robust to transformation
Sensitive to outliers Robust to outliers
Computationally cheap Computationally more expensive

type = “nlrq”

Non-linear quantile regression.

Longitudinal Trait nlrq
Non-linearity
Autocorrelation
Heteroskedasticity

type = “nlme”

Non-linear Mixed Effect Models

Longitudinal Trait nlme
Non-linearity
Autocorrelation*
Heteroskedasticity
Being a headache

type = “mgcv”

General Additive Models Only

Longitudinal Trait gam
Non-linearity
Autocorrelation
Heteroskedasticity
Unparameterized

type = “brms”

Bayesian hierarchical Models

Longitudinal Trait brms
Non-linearity
Autocorrelation
Heteroskedasticity

Supported Growth Models

There are 6 main growth models supported in pcvr.

3 are asymptotic, 3 are non-asympototic.

Supported Growth Models

Supported Growth Models 2

There are also two double sigmoid curves intended for use with recovery experiments.

Supported Growth Models 3

Generalized Additive Models (GAMs) are also supported.

Survival Models

Survival models can also be specified using the “survival” keyword. These models can use the “survival” or “flexsurv” backends where distributions can be specified as model = "survival {distribution}".

For details please see the growthSS documentation.

GAMs

m <- mgcv::gam(y ~ group + s(time, by =factor(group)), data=simdf)
start<-min(simdf$time); end <-max(simdf$time)
support <- expand.grid(time = seq(start, end, length = 400),
                       group = factor(unique(simdf$group)))

out<-gam_diff(model=m, newdata=support, g1="a", g2="b",
              byVar = "group", smoothVar="time", plot=T, patch=F)

gam_diff predictions

gam_diff differences

growthSS

Any of these models shown can be specified easily using growthSS.

growthSS - form

With a model specified a rough formula is required to parse your data to fit the model.

The layout of that formula is:

outcome ~ time|individual/group

growthSS - form 2

Here we would use y~time|id/group

head(simdf)
    id group time         y
1 id_1     a    1 0.5860409
2 id_1     a    2 0.9368355
3 id_1     a    3 1.4960998
4 id_1     a    4 2.3853943
5 id_1     a    5 3.7936108
6 id_1     a    6 6.0089440

growthSS - form 3

We can check that the grouping in our formula is correct with a plot.

ggplot(simdf,aes(x=time, y=y,
                 group=paste(group,id)))+ 
  geom_line(aes(color=group))+ 
  labs(title="Testing Formula",
       subtitle = "y ~ time | id/group")+
  pcv_theme()

growthSS - sigma

“nlme” and “brms” models accept a sigma argument. Here we will only look at nlme models as brms models are the subject of the Advanced Growth Modeling tutorial.

growthSS - sigma

Recall the heteroskedasticity problem, shown again with our simulated data:

growthSS - sigma

There are lots of ways to model a trend like that we see for sigma.

pcvr offers three options through growthSS for nlme models.

growthSS - “none” sigma

Variance can be modeled as homoskedastic by group.

growthSS - “power” sigma

Variance can be modeled using a power of the x term.

growthSS - “exp” sigma

Variance can be modeled using a exponent of the x term.

growthSS - varFunc Options

nlme varFunc objects can also be passed to the sigma argument in growthSS.

See ?nlme::varClasses for details.

growthSS - start

One of the useful features in growthSS is that you do not need to specify starting values for the supported non-linear models (double sigmoid options notwithstanding).

growthSS - tau

Finally, with mode=“nlrq” the tau argument takes one or more quantiles to fit a model for. By default this is 0.5, corresponding to the median.

growthSS - nls

nls_ss <- growthSS(model="logistic", form =  y~time|id/group,
                   df=simdf, type = "nls")
lapply(nls_ss, class)
$formula
[1] "formula"

$start
[1] "list"

$df
[1] "data.frame"

$pcvrForm
[1] "formula"

$type
[1] "character"

$model
[1] "character"

$call
[1] "call"

growthSS - nlrq

nlrq_ss <- growthSS(model="logistic", form =  y~time|id/group,
                    df=simdf, type = "nlrq",
                    tau = seq(0.01, 0.99, 0.04))
lapply(nls_ss, class)
$formula
[1] "formula"

$start
[1] "list"

$df
[1] "data.frame"

$pcvrForm
[1] "formula"

$type
[1] "character"

$model
[1] "character"

$call
[1] "call"

growthSS - nlme

nlme_ss <- growthSS(model="logistic", form =  y~time|id/group,
                    df=simdf, sigma = "power", type = "nlme")
names(nlme_ss)
[1] "formula"  "start"    "df"       "pcvrForm" "type"     "model"    "call"    
names(nlme_ss$formula)
[1] "model"    "random"   "fixed"    "groups"   "weights"  "cor_form"

growthSS - mgcv

mgcv_ss <- growthSS(model="gam", form =  y~time|id/group,
                   df=simdf, type = "mgcv")
lapply(mgcv_ss, class)
$formula
[1] "formula"

$df
[1] "data.frame"

$pcvrForm
[1] "formula"

$type
[1] "character"

$model
[1] "character"

$call
[1] "call"

growthSS - survival models

surv_ss <- growthSS(model="survival weibull",
                    form =  y>100~time|id/group,
                   df=simdf, type = "survreg") # type = "flexsurv" has more distribution options
lapply(surv_ss, class)
$df
[1] "data.frame"

$formula
[1] "formula"

$pcvrForm
[1] "formula"

$distribution
[1] "character"

$type
[1] "character"

$model
[1] "character"

$call
[1] "call"

growthSS - survival models

surv_ss <- growthSS(model="survival weibull",
                    form =  y>100~time|id/group,
                   df=simdf, type = "survreg") # type = "flexsurv" has more distribution options
lapply(surv_ss, class)
$df
[1] "data.frame"

$formula
[1] "formula"

$pcvrForm
[1] "formula"

$distribution
[1] "character"

$type
[1] "character"

$model
[1] "character"

$call
[1] "call"

Note that here we still supply the standard phenotype data but give a cutoff on the left hand side of the formula. That cutoff represents the “event”. For example, given area in pixels germination might be area_px>10 ~ time|id/group.

fitGrowth

Now that we have the components for our model from growthSS we can fit the model with fitGrowth.

fitGrowth 2

With non-brms models the steps to fit a model specified by growthSS are very simple and can be left to fitGrowth.

nls_fit <- fitGrowth(nls_ss)
nlrq_fit <- fitGrowth(nlrq_ss, cores = 4)
nlme_fit <- fitGrowth(nlme_ss)
mgcv_fit <- fitGrowth(mgcv_ss)
surv_fit <- fitGrowth(surv_ss)

fitGrowth 3

Additional arguments can be passed to fitGrowth and are used as follows:

type
nls passed to stats::nls
nlrq cores to run in parallel, passed to quantreg::nlrq
nlme passed to nlme::nlmeControl
mgcv passed to mgcv::gam

growthPlot

Models fit by fitGrowth can be visualized using growthPlot.

growthPlot - nls

growthPlot - nlrq

growthPlot - nlme

growthPlot - mgcv

growthPlot - surv

Hypothesis Testing

Hypothesis testing for frequentist non-linear growth models can be somewhat limited.

Broadly, testing is implemented for all backends by comparing models against nested null models and for select backends non-linear testing is available using testGrowth.

testGrowth - nls

testGrowth(nls_ss, nls_fit, test = "A")$anova
Analysis of Variance Table

Model 1: y ~ A/(1 + exp((B[group] - time)/C[group]))
Model 2: y ~ A[group]/(1 + exp((B[group] - time)/C[group]))
  Res.Df Res.Sum Sq Df Sum Sq F value    Pr(>F)    
1    995     201112                                
2    994     168892  1  32220  189.63 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

testGrowth - nls 2

testGrowth(nls_ss, nls_fit, test = list("A1 - A2",
                                        "B1 - (B2*1.25)",
                                        "(C1+1) - C2"))
            Form   Estimate        SE   t-value      p-value
1        A1 - A2 40.5504232 2.4914371 16.275917 5.551605e-53
2 B1 - (B2*1.25) -1.7934648 0.2012856  8.910050 2.388216e-18
3    (C1+1) - C2  0.5332839 0.1340484  3.978294 7.444976e-05

testGrowth - nlrq

Here we only print out the comparison for the model of the 49th percentile, but all taus are returned.

testGrowth(nlrq_ss, nlrq_fit, test = "A")[[13]]
Model 1: y ~ A[group]/(1 + exp((B[group] - time)/C[group]))
Model 2: y ~ A/(1 + exp((B[group] - time)/C[group]))
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   6 -3971.0                         
2   5 -4101.7 -1 261.51  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

testGrowth - nlme

testGrowth(nlme_ss, nlme_fit, test = "A")$anova
        Model df      AIC      BIC    logLik   Test  L.Ratio p-value
nullMod     1 13 4930.142 4993.943 -2452.071                        
fit         2 16 4920.662 4999.186 -2444.331 1 vs 2 15.47946  0.0014

testGrowth - nlme 2

testGrowth(nls_ss, nlme_fit, test = list("A.groupa - A.groupb",
                                        "B.groupa - (B.groupb*1.25)",
                                        "(C.groupa+1) - C.groupa"))
                        Form  Estimate        SE  Z-value      p-value
1        A.groupa - A.groupb 39.202093 4.8025053 8.162842 3.272318e-16
2 B.groupa - (B.groupb*1.25) -1.806234 0.2039578 8.855920 8.299624e-19
3    (C.groupa+1) - C.groupa  1.000000 0.0000000      Inf 0.000000e+00

testGrowth - mgcv

Due to GAMs nature we cannot test parameters individually.

testGrowth(mgcv_ss, mgcv_fit)$anova
Analysis of Deviance Table

Model 1: y ~ s(time)
Model 2: y ~ 0 + group + s(time, by = group)
  Resid. Df Resid. Dev     Df Deviance      F    Pr(>F)    
1    992.53     311323                                     
2    985.19     169322 7.3344   142001 112.91 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

testGrowth - surv

The flexsurv backend provides more flexibility, but standard survreg models are tested using survival::survdiff

testGrowth(surv_ss, surv_fit)
Call:
survival::survdiff(formula = ss$formula, data = ss$df)

         N Observed Expected (O-E)^2/E (O-E)^2/V
group=a 20       20     16.2     0.893      2.73
group=b 20       20     23.8     0.608      2.73

 Chisq= 2.7  on 1 degrees of freedom, p= 0.1 

Final Considerations

  • Pick model builders and parameterizations based on your needs
  • Use GAMs sparingly, they present interpretability problems compared to the parameterized models.
  • Consider the biology behind your observed data while picking a growth model.

Resources

If you run into a novel situation please reach out and we will try to come up with a solution and add it to pcvr if possible.

Good ways to reach out are the help-datascience slack channel and pcvr github repository.